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A quick overview of the Top 10 landscape 


© © 


Q 
Q 
(20) 


Jaguar - Cray XT5-HE Opteron Six Core 
2.6 GHz 

Nebulae - Dawning TC3600 Blade, Intel 
X5650, NVidia Tesla C2050 GPU 
Roadrunner - BladeCenter QS22/LS21 
Cluster, PowerXCell 8i 3.2 Ghz / Opteron 
DC 1.8 GHz, Voltaire Infiniband\ 

Kraken XT5 - Cray XT5-HE Opteron Six 
Core 2.6 GHz 

JUGENE - Blue Gene/P Solution 
Pleiades - SGI Altix ICE 8200EX/8400EX, 
Xeon HT QC 3.0/Xeon Westmere 2.93 
Ghz, Infiniband 

Tianhe-1 - NUDT TH-1 Cluster, Xeon 
E5540/E5450, ATI Radeon HD 4870 2, 
Infiniband 

BlueGene/L - eServer Blue Gene Solution 
Intrepid - Blue Gene/P Solution 

Red Sky - Sun Blade x6275, Xeon X55xx 
2.93 Ghz, Infiniband 
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The top 10 by type 


ə These systems range in costs from $100M 
up 
ə No, it can't be done with Google clusters 


ə 5 COTS clusters @ 2, 3, 6, 7, 10 
2 part-custom (Cray XT-5) @ 1, 4 
3 full-custom (Blue Gene) @ 5,8,9 
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y Cluster Example: RoadRunner 


Roadrunner Node ~ 


Memory ~ 


‘aratiel Computing Network (Infintband DOR) 
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XT[4,5,6] 


Current Dual Core Opteron Phase ad Core Opteron 


Core 1: 28 GHz Core 2: 2.8 GHz ore 1. 


L2: IMB L2: IMB L2: 512KB 2 L2; 


System request interface & 


L3: 2MB 
Sera Hypertransport links st interface & crossbar switch 
Memory controller 
J @ S00MEE Hypertransport links 
To main memory: To Seastar interconnect 
252GB+2x1GBDMM& (o 
To main memory: To Seastar interconnect 
4x2GB DIMMs (communication with off-node 


processes and IO system) 
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e: Blue Gene/L CPU 


A quick overview of the Top 10 landscape 


Cabinet 
css GEpAA aA) 
Nocb Board | 
(32 chips, 4x4x2) 


5611.2 GFis 


285.6GHs 05G8 DDR 
4MB 


Figure 1: BlueGene/L packaging. 
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Our part in the landscape 


Background 


ə Over past four years HARE project (DOE, 
IBM, Bell Labs, Vita Nuova) has ported 
Plan 9 to two of the largest 
supercomputers in the world 

ə BG/L and then BG/P 

ə Research to study the value of Plan 9 in 
this context 

ə Plan 9 replaces two existing OSes: IBM's 
Compute Node Kernel (CNK) and/or 


Linux 
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Total port effort for BG/L 


ə 16 man weeks 

ə How much assembly in Plan 9 kernel? 
1033 lines 

How many files in Plan 9 BG/L kernel? — 
About 90, including auto-generated by 
config 


ə 18 are platform-specific — — Of which we 
had to modify about 10 


Plan 9 (an OS) is smaller than every MPI 
library 


BG/P effort was similar 


ə You can see all our code: 
http://bitbucket.org/ericvh/hare 
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The kernel part was the easy part 


the kernel is big? you oughtta see 
mode software! 


ə A very large system ... DCMF, a “simple” 
comms library, is LOOKLOC of C++ 


ə The configuration file for openmpi is 
150KLOC 


ə And let us not forget the other runtime 
goo such as Python 
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The kernel part was the easy part 


o one million lines of 
Fortran code for one program 


ə I'm including a lot of library code in this 


@ written in GNU C (its own standard) and 
GNU C++ (its own standard) 


ə Configured at run time with Python 


ə A source port to Plan 9 is simply not in 
the cards 


ə Just consider what it took just to port gcc 
— and it’s not really working that well yet 


ə We considered several options, including 
source-source transformation, porting 
compilers, etc. 


@ None of these options was workable 
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How we can run CNK binaries 


ə Next question: how? 


ə First we need to see what a CNK binary 
looks like: 


1M boundary Stack 


1M boundary Heap 


1M boundary Data 


1M boundary Code 
@ Base at 16 MB 


re: A CNK Binary layout 


ə This suggests an idea 

ə If we want a Plan 9 “manager” program 

ə It can live in same address space as the 
binary itself 
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How we can run CNK binaries 


End up looking like this: 


1M boundary Plan 9 Stack 
Stack 


1M boundary Heap 


1M boundary Data 


1M boundary Code 


Dead Zone 


Plan 9 heap 


Plan 9 Data 
Plan 9 Code 


Figure: Plan 9 “shepherd” process image 
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How we can run CNK binaries 


Other issues ... 


ə In LinuxEMU, on x86, INT 80 is invalid 
and is trapped via notes to a manager 


ə Not practical on Blue Gene for several 
reasons 


ə Only one system call instruction used by 
Plan 9 and Linux and CNK 


ə Efficiency issue 


e@ Kernel always knows more than user mode 
(e.g. it knows physical addresses and user 
only knows virtual) 

ə So it really only makes sense to trap in 
kernel 
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Let’s go look at code 


Trapping in kernel 


ə There's only one system call type 


ə So you have to mark the process and 
switch out on the mark 


ə OK, let's go look at some code 
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Let’s go look at machcnk 


The code that runs the code 


ə Called machcnk since it runs libmach 


@ one interesting thing: you can flip back 
and forth 


ə Back into code ... 
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Conclusions 


Conclusions 


We can emulate Linux in the Plan 9 kernel 
We can switch back and forth 


ə The “shepherd” model works on Blue Gene 
because we know where the binaries live 


@ On real Linux it’s not nearly this easy 


ə Unless we recompile all the GNUbin to live 
above 16M — which is not that hard 


@ This is one path to efficient, integrated 
emulation in the Plan 9 kernel 
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